Lecture 01: Introduction and Foundations

PUBH 8878, Statistical Genetics

Introductions

About Me

My dog Rosie and I
  • I’m Chiraag Gohel, a 4th Year PhD student in the Department of Biostatistics and Bioinformatics at George Washington University.
  • My interests are in the modeling of high-dimensional ’omics data
  • I play a lot of squash

Syllabus

What is statistical genetics?

Population Genetics

Genetic Epidemiology

Quantitative Genetics

A statistical geneticist may want to know

  1. Is there a genetic component contributing to the total variance of these traits?
  2. Is the genetic component of the traits driven by a few genes located on a particular chromosome, or are there many genes scattered across many chromosomes? How many genes are involved and is this a scientifically sensible question?
  3. Are the genes detected protein-coding genes, or are there also noncoding genes involved in gene regulation?
  4. How is the strength of the signals captured in a statistical analysis related to the two types of genes? What fraction of the total genetic variation is allocated to both types of genes?
  5. What are the frequencies of the genes in the sample? Are the frequencies associated with the magnitude of their effects on the traits?
  6. What is the mode of action of the genes?

A statistical geneticist may want to know

  1. What proportion of the genetic variance estimated in 1 can be explained by the discovered genes?
  2. Given the information on the set of genes carried by an individual, will a genetic score constructed before observing the trait help with early diagnosis and prevention?
  3. How should the predictive ability of the score be measured?
  4. Are there other non-genetic factors that affect the traits, such as smoking behavior, alcohol consumption, blood pressure measurements, body mass index and level of physical exercise?
  5. Could the predictive ability of the genetic score be improved by incorporation of these non-genetic sources of information, either additively or considering interactions? What is the relative contribution from the different sources of information?

What does the data look like?

Family/Pedigree Studies

Family/Pedigree Studies

What does the data look like?

Genome-Wide Association Studies (GWAS)

Genome-Wide Association Studies

Core abstractions for quantitative biology

  1. Model-Building
    • Formulate generative models linking genotype, environment, and phenotype (e.g., linear mixed, Bayesian hierarchical, non-parametric kernels).
    • Encode biological structure: linkage & LD, population stratification, dominance/epistasis, multi-omics priors.
    • Balance realism and tractability to enable scalable computation on genome-scale data.

Core abstractions for quantitative biology

  1. Inference
    • Estimate unknown parameters and latent effects via likelihood maximisation, EM, MCMC, or SGD
    • Quantify uncertainty with standard errors, posterior intervals, and credible sets
    • Control false discoveries across millions of tests with FDR/Q-value, permutation, and empirical-Bayes shrinkage.

Core abstractions for quantitative biology

  1. Prediction
    • Use fitted models for out-of-sample trait prediction: BLUP/GBLUP, ridge/lasso/elastic-net, Bayesian whole-genome regressions, random forests, neural nets.
    • Evaluate accuracy (MSE, AUC), bias–variance trade-off, calibration, and portability across ancestries or cell types.
    • Translate genomic predictions into actionable scores for breeding, risk stratification, and drug-target prioritisation.

Core abstractions for quantitative biology

  1. Interpretation & Validation
    • Integrate functional annotations, eQTL, and single-cell data to refine biological mechanisms.
    • Perform replication, cross-cohort meta-analysis, and sensitivity analyses to population assumptions.
    • Communicate findings with clear visualisations and reproducible workflows (R/Bioconductor, Git, notebooks).

 

 

Some vocabulary

  • Trait/Phenotype: A measurable characteristic of an organism, such as height, weight, or disease status.
  • Genotype: The genetic constitution of an individual, often represented by specific alleles at particular loci.
  • Allele: A variant form of a gene that can exist at a specific locus on a chromosome.
  • Locus: A specific, fixed position on a chromosome where a particular gene or genetic marker is located.
  • Polymorphism: The occurrence of two or more genetically determined forms in a population, such as single nucleotide polymorphisms (SNPs) or copy number variations (CNVs).
  • Genetic Marker: A specific DNA sequence with a known location on a chromosome that can be used to identify individuals or species, often used in genetic mapping or association studies.
  • Linkage Disequilibrium (LD): The non-random association of alleles at different loci, indicating that certain allele combinations occur together more frequently than expected by chance.
  • Heritability: The proportion of phenotypic variance in a trait that can be attributed to genetic variance, often estimated through twin or family studies.

Some vocabulary

  • Genome-Wide Association Study (GWAS): A study that looks for associations between genetic variants across the genome and specific traits or diseases in a large population.
  • Polygenic Score (PGS): A score that aggregates the effects of multiple genetic variants to predict an individual’s genetic predisposition to a trait or disease.
  • Quantitative Trait Locus (QTL): A region of the genome that is associated with a quantitative trait, often identified through linkage or association mapping.
  • Epistasis: The interaction between genes where the effect of one gene is modified by one or more other genes, influencing the expression of a trait.
  • Epigenetics: The study of heritable changes in gene expression that do not involve changes to the underlying DNA sequence, often influenced by environmental factors.
  • Functional Annotation: The process of identifying the functional elements in the genome, such as genes, regulatory regions, and non-coding RNAs, and understanding their roles in biological processes.

Early work in the field

Genetics, statistics, and eugenics, have always been closely linked

Standard eugenics scheme of descent

Francis Galton and his work, “Inquires into Human Faculty and its Development”

The ethics of genetics and genomics is/(was) of great importance!

Parameter Estimation

  • Imagine a geneticist is studying allele ages (i.e., how many generations ago an allele arose via mutation) under a simplified model where new mutations arise randomly and uniformly over a fixed window — say, the last \(\theta\) generations.
  • We could model \(\theta \sim \textsf{Uniform}(0, \theta)\).
  • We have our observed sample \(\{x_1,x_2, \ldots, x_n\}\), ages in generations of \(n\) observed alleles.
    • For concreteness, let our sample be \(\{0, 1, 8, 4, 3, 1, 1, 5\}\).
  • How should we estimate \(\theta\) ?

Importance of statistical theory

  • Understanding random variables, sampling distributions, and bias/variance of estimators helps motivate biological questions, and understand how they can be answered
  • Let y be the expression of a trait, G be the additive contribution of genetic variables, and an environmental value E;
  • What assumptions do we make via the following models?

\[\begin{align} y &= G + E \\ y &= \beta_0 + \beta_1 G + \beta_2 E + \epsilon \\ y &= \beta_0 + \beta_1 G + \beta_2 E + \epsilon, \quad \epsilon \sim N(0, \sigma^2) \\ y &= \beta_0 + \beta_1 G + \beta_2 E + \beta_3 \left(G \times E\right) + \epsilon, \quad \epsilon \sim N(0, \sigma^2) \end{align}\]

 

Molecular Genetics

Genetic Variants

The sampling distribution of a random variable

  • A random variable is a function that assigns a numerical value to each outcome in a sample space.
  • A probability distribution describes how the probabilities are distributed over the values of the random variable.
  • The likelihood function is a function of the parameters of a statistical model given specific observed data.

Modeling

  • Imagine that we have a sample size of \(n\) unrelated haploid individuals from some population
  • We want to estimate allele frequencies for a biallelic SNP, sy A/a
  • In our sample, we observe \(x\) individuals with allele A and \(n - x\) individuals with allele a
  • Let \(p\) be the frequency of allele A and \(q = 1 - p\) be the frequency of allele a

\[Pr(X = x| n, p) = \binom{n}{x} p^x q^{n - x}\]

  • Lets say we observe \(n = 27\) and \(x = 11\). How do we estimate \(p\)?

Modeling, but make it bayesian

  • Let’s say we have run a previous experiment and observed \(n = 20\) individuals with allele A and \(x = 3\) individuals with allele A
  • We can use this information to inform our prior distribution for \(p\)